m6a RNA科研设计、数据分析、结果解读 精品培训班 实操环节

除了 早期测序质控,比对后质控,还有一个质控——实验质控(带生物学意义的质控)
deeptools真的很deep

五、Sample QC (发错货这事,有时挺常见的)

1、preparation(两种模式——全基因组|特定区域)

work_dir=/home4/sjshen/project/m6a/GSE29714_Cell
hisat2_bam_dir=$work_dir/1.1_hisat2/mouse/position_sorted_bam

echo $hisat2_bam_dir
cd $hisat2_bam_dir

(nohup multiBamSummary bins  -p 60 --bamfiles $hisat2_bam_dir/*bam -out ./deeptools/results_whole_region.npz ) &
(nohup multiBamSummary BED-file --BED ./mm10.RefSeq.bed  -p 60 --bamfiles $hisat2_bam_dir/*bam -out ./deeptools_qc/results_gene_region.npz ) &

Tips1:multiBamSummary是deeptools套件的一个函数,主要用于生成下游质控的初始文件
Tips2:默认划bins是10kb一个bin
Tips3:mm10.RefSeq.bed 来自 UCSC TABLE broswer

2、Correlation

plotCorrelation -in results_gene_region.npz --corMethod spearman --skipZeros --plotTitle "Spearman Correlation of Read Counts gene region" \
--labels SYSY-MeRIP-rep1 input-rep1 NEB-MeRIP-rep2 SYSY-MeRIP-rep2 input-rep2 --colorMap PiYG \
--whatToPlot heatmap  --plotNumbers -o heatmap_SpearmanCorr_readCounts_gene_region.png --outFileCorMatrix SpearmanCorr_readCounts_gene_region.tab &

plotCorrelation -in results_gene_region.npz --corMethod spearman --skipZeros --plotTitle "Spearman Correlation of Read Counts gene region" \
--labels SYSY-MeRIP-rep1 input-rep1 NEB-MeRIP-rep2 SYSY-MeRIP-rep2 input-rep2 --log1p \
--whatToPlot scatterplot -o heatmap_SpearmanCorr_readCounts_gene_region_2.png &

Tips1:spearman 和 pearson 是最常用的方法,pearson更容易受异常值影响
Tips2:用于判断异常样本
Tips3:文件名太长可以用 --labels 修改
Tips4:可修改颜色

 

heatmap

scatterplot

3、PCA

cd  $hisat2_bam_dir/deeptools_qc

plotPCA --transpose --markers 'o' '>' 'o' 'o' '>' -in results_gene_region.npz -o PCA_readCounts_gene_region.png -T "PCA of read counts gene region" --outFileNameData PCA_readCounts_data_gene_region.tab &

plotPCA --transpose --markers 'o' '>' 'o' 'o' '>' -in results_whole_region.npz -o PCA_readCounts_whole_region.png -T "PCA of read counts whole region" --outFileNameData PCA_readCounts_data_whole_region.tab &

Tips1:PCA是一种降维方法,他将样本的诸多差异(基因|bins|...)浓缩到二维,其中PC1和PC2是区别最大的前两个维度
Tips2:默认取前1000个影响因素
Tips3:用于判断异常样本
Tips4:文件名太长可以用 --labels 修改

 

gene_region范围计算

whole_region范围计算

4、fragment length

work_dir=/home4/sjshen/project/m6a/GSE29714_Cell
hisat2_bam_dir=$work_dir/1.1_hisat2/mouse/position_sorted_bam

for i in $hisat2_bam_dir/*.bam
do
echo $i
(nohup bamPEFragmentSize -p 20 -hist ${i##*/}fragmentSize.png -T "Fragment size of SE MeIP-seq data" --maxFragmentLength 200 --bamfiles  $i  --samplesLabel  ${i##*/}) > ./${i##*/}fragmentSize.log &
done

Tips:主要用于判断片段超声情况,一般m6A大约在100-200nt


5、other

plotFingerprint
plotCoverage
#这两个函数主要检查抗体富集和reads覆盖度,仅供参考


   

六、reads distribution (m6a在基因上的分布——另一种QC)

## 1.input normalize ##  消除背景噪音

ChIP=C57BL6_Brain_Sample_1_MeRIP-SYSY.hisat2
Input=C57BL6_Brain_Sample_1_Non-IP_Control.hisat2
operation=log2

bigwigCompare -p 50  -b1 ${ChIP}.bw  -b2 ${Input}.bw \
--operation ${operation}  --binSize 10 --pseudocount 1 \
-o ${ChIP}_${operation}_${Input}.bw --outFileFormat bigwig \
> ${ChIP}_${operation}_${Input}.log &

#--------------------------------------------------------------------

## 2.prepare matrix ##

cd /home4/sjshen/project/m6a/GSE29714_Cell/1.1_hisat2/mouse/position_sorted_bam/deeptools_profile/result

bw_dir1=/home4/sjshen/project/m6a/GSE29714_Cell/1.1_hisat2/mouse/position_sorted_bam/deeptools_bw/assemble_bw
region_dir1=/home4/sjshen/project/m6a/GSE29714_Cell/1.1_hisat2/mouse/position_sorted_bam/deeptools_profile/bed_home

bw_files="
$bw_dir1/C57BL6_Brain_Sample_1_MeRIP-SYSY.hisat2_log2ratio_C57BL6_Brain_Sample_1_Non-IP_Control.hisat2.bw 
$bw_dir1/C57BL6_Brain_Sample_2_MeRIP-NEB.hisat2_log2ratio_C57BL6_Brain_Sample_2_Non-IP_Control.hisat2.bw 
$bw_dir1/C57BL6_Brain_Sample_2_MeRIP-SYSY.hisat2_log2ratio_C57BL6_Brain_Sample_2_Non-IP_Control.hisat2.bw 
"
region_files="
$region_dir1/mm10.RefSeq.cds.bed
"

try=MeIRP_profile_on_cds_mm10_input_norm

nohup computeMatrix scale-regions -p 100  -b 1000 -a 1000 --regionBodyLength 3000 -R $region_files \
-S $bw_files --startLabel cdsStart  --endLabel cdsEnd --binSize 50 \
--samplesLabel  SYSY-MeRIP-rep1 NEB-MeRIP-rep2 SYSY-MeRIP-rep2 \
-o ./${try}_computeMatrix.gz --outFileNameMatrix ./${try}_computeMatrix.matrix.tab \
--outFileSortedRegions ./${try}_computeMatrix.SortedRegions.bed > ./${try}_computeMatrix.log &

#--------------------------------------------------------------------

## 3.display graph  ##

try=MeIRP_profile_on_cds_mm10_input_norm

plotProfile --matrixFile ${try}_computeMatrix.gz --outFileName ${try}_profile_1.pdf --outFileNameData ${try}_profile_1.tab --dpi 720 --perGroup --numPlotsPerRow 2 --plotHeight 12 --plotWidth 16 &
plotProfile --matrixFile ${try}_computeMatrix.gz --outFileName ${try}_profile_2.pdf --outFileNameData ${try}_profile_2.tab --dpi 720            --numPlotsPerRow 2 --plotHeight 12 --plotWidth 16 &

Tips:主要查看基因CDS区域上带有m6a的reads的分布

# 也可以展示每个基因的情况
plotHeatmap --matrixFile ./${try}_computeMatrix.gz --outFileName ./${try}_peak.pdf --colorList "white,blue" --whatToShow "plot, heatmap and colorbar" --dpi 720 --missingDataColor white \
--refPointLabel "center" --regionsLabel " "  --xAxisLabel " " --sortRegions descend --sortUsing mean --heatmapHeight 30 --heatmapWidth 4 &

Dependencies